! CHANGE LATER ! We are viewing distribution of humidity because we want to use this as our response. Humidity is our response because it has the most natural variation in Santa Barbara and has implications for climate change, epidemiology, electronics, and other major aspects of modern life.

Exploratory Data Analysis

Before anything else, we must load all of our packages. You can see the full list of packages I am using below.

library(tidymodels)
library(ISLR)
library(ISLR2)
library(discrim)
library(poissonreg)
library(corrr)
library(klaR)
library(dplyr)
library(ggplot2)
library(glmnet)
library(parsnip)
library(janitor)
library(corrplot)
library(randomForest)
library(rpart)
library(rpart.plot)
library(regclass)
library(broomstick)



Reading in Data

Now that that’s out of the way, it’s time to read in my data. I used clean_names() from the janitor package to make my data easier to wrangle. Before reading in this csv, I spent a lot of time cleaning it in Excel - there were about four weeks of missing data in which absolutely no data was read. I simply removed those rows as there was no use for them, and out of my >2000 (now 2530) observations they constituted a very small part. I will also make sure R can read my date values as dates rather than characters, and add a year_month variable in order to calculate monthly averages. This makes some of my data, such as precipitation, temperature, and humidity, far easier to plot.

weather <- read.csv("weatherdatanew.csv") %>% clean_names()
# We clean the variable names to make the code easier to write
weather$date <- as.Date(weather$date, format="%m/%d/%y")
# Cleaning date values
weather <- weather %>% mutate(year_month=format(date, "%y/%m"), .before=date)
# Adding year_month variable in order to create monthly averages for weather values



Correlation Matrix

Now it’s time to start my EDA. Below, I created a correlation matrix of all of my variables in order to highlight important relationships between the predictors and response.

weather %>% 
  dplyr::select(where(is.numeric)) %>% 
  cor() %>% 
  corrplot(type = 'lower', diag = FALSE, 
           method = 'color')

# Correlation matrix of all numeric variables in order to ascertain which correlations stand out



The correlation matrix has some interesting implications - while some of the relationships were anticipated, such as the correlation between average wind speed and maximum wind speed, some others were less obvious. In particular, the moderate negative correlation between average humidity and maximum wind speed and the strong positive correlation between average dew point and humidity stood out to me. Thinking about it, these relationships make sense. High wind speed means that water molecules in the air are spread further apart from each other, so it’s harder for them to condense - thus, dew point is lower. Average dew point is also related to humidity in that dew point is the air temperature at which relative humidity would be 100% - water then begins to condense into dew. If humidity is high, it makes sense that the dew point would not have to be as low for relative humidity to reach 100% and for water to start condensing.

Exploring Individual Variables

I’d like to find out if there are any patterns in my data, so I’ve plotted several line graphs below, each exploring one of the variables.

Average Monthly Humidity

humidity_avg <- aggregate(average_humidity ~ year_month, weather, mean)
# Aggregate means of monthly humidity in a new data frame
h <- ggplot(data=humidity_avg, aes(x=year_month, y=average_humidity, group=1)) + 
  geom_line() + labs(y="Average Humidity", x="Year/Month")  
# Store plot for monthly average humidity
h + theme(axis.text.x = element_text(angle = 90))

# Show plot while rotating x-axis values 90 degrees to prevent crowding



While this graph is somewhat rough, it still makes sense - humidity tends to peak most in the warmest months, as increased sunlight and heat causes more water to evaporate from the ocean and flora/fauna on land.

Average Monthly Temperature/Dew Point

temp_avg <- aggregate(average_temperature_o_f ~ year_month, weather, mean)
# Aggregate means of monthly temp in a new data frame
dew_avg <- aggregate(average_dew_point_o_f ~ year_month, weather, mean)
# Aggregate means of monthly dew point in a new data frame

total1 <- merge(humidity_avg, temp_avg, by="year_month")
total <- merge(total1, dew_avg, by="year_month")
ggplot(data=total, aes(x=year_month)) + 
  geom_line(aes(y = average_temperature_o_f, group=1, color="Temperature")) + 
  geom_line(aes(y = average_dew_point_o_f, group=1, color="Dew Point")) + 
  theme(axis.text.x = element_text(angle = 90)) + labs(y="Degrees Fahrenheit", x="Year/Month")



As expected, monthly average temperature and dew point closely follow each other. As temperatures increase, the temperature at which water condenses from the air grows higher, as there is more water vapor in the air. This correlates well with the humidity graph as well, which has peaks in warmer months.



Total Monthly Precipitation

precip_month <- aggregate(precipitation_in ~ year_month, weather, sum)

ggplot(data=precip_month, aes(x=year_month)) + geom_line(aes(y=precipitation_in, group=1)) + 
  theme(axis.text.x = element_text(angle = 90)) + labs(y="Total Precipitation", x="Year/Month")



This graph is hardly surprising. Precipitation peaks regularly between November and April every year, with the height of the peaks varying depending on whether a drought is currently in effect as well as other environmental factors.



Daily Average Pressure

ggplot(data=weather, aes(x=date)) + geom_point(aes(y=average_pressure_in_hg, group=1)) + 
  theme(axis.text.x = element_text(angle = 90)) + labs(y="Average Pressure (inHg)", x="Date")



Since Santa Barbara Airport is very close to sea level, pressure stays very stable in the area. Still, some peaks and troughs are visible in this scatterplot, though they are shallow. Notably, four points stand out as outliers below the graph - I imagine these are recording errors, or days of exceptionally low pressure prior to a storm. There are a few other points that stick out slightly below and above the main line of the scatterplot, but they seem to be legitimate high- or low-pressure days rather than recording errors.

Daily Average and Maximum Wind Speed

ggplot(data=weather, aes(x=date)) + geom_line(aes(y=average_wind_speed_mph, group=1, color="Average Wind Speed")) +
  geom_line(aes(y=maximum_wind_speed_mph, group=1, color="Maximum Wind Speed")) +
  theme(axis.text.x = element_text(angle=90)) + labs(y="Miles per Hour", x="Date")



Interestingly, average and maximum wind speed both dip every year around December/January. I’m not sure exactly why this is, but I believe it has to do with the fact that air pressure tends to stay low around that time of year, so there is not much movement of air.



Modeling

Now it’s time to model the data. I’m choosing four models: a regression decision tree, a random forest with bagging, a boosted trees model, and a linear regression model.

Initial split

Since my dataset is relatively small (2530 observations), I’m choosing a 75-25 training-testing split and only creating 5 cross-validation folds with 5 repeats. I’m stratifying on the outcome variable as well.

set.seed(3945)
weather_split <- initial_split(weather, prop=0.75, strata=average_humidity)
weather_train <- training(weather_split)
weather_test <- testing(weather_split)
weather_fold <- vfold_cv(weather_train, v=5, strata=average_humidity, repeats=5)



Recipe creation

Date variables are tiresome to integrate into our models, especially considering they do not fit neatly into numeric values, so we will leave them out when creating our recipe.

weather_recipe <- recipe(average_humidity ~ average_temperature_o_f + average_dew_point_o_f + maximum_wind_speed_mph + average_wind_speed_mph + average_pressure_in_hg + precipitation_in, data=weather_train)



Model creation



Regression Tree



First we fit a regression decision tree. We are tuning cost_complexity and tree_depth, then choosing the best model out of a tuning grid with 5 levels. After selecting the best decision tree, we will plot it so we can visualize it.

set.seed(3945)
regtreespec <- decision_tree() %>%
  set_engine("rpart") %>%
  set_mode("regression")

regtreewf <- workflow() %>%
  add_model(regtreespec %>% set_args(cost_complexity = tune(), tree_depth = tune())) %>%
  add_formula(average_humidity ~ maximum_wind_speed_mph + average_wind_speed_mph + average_pressure_in_hg + precipitation_in + average_temperature_o_f + average_dew_point_o_f)

param_grid <- grid_regular(cost_complexity(), tree_depth(), levels = 5)

tune_res <- tune_grid(
  regtreewf,
  resamples = weather_fold, 
  grid = param_grid
)

bestcomplexity <- select_best(tune_res, metric = "rmse")

regtreefinal <- finalize_workflow(regtreewf, bestcomplexity)

regtreefinalfit <- fit(regtreefinal, data = weather_train)

getcp(extract_fit_engine(regtreefinalfit))

regtreefinalpruned <- prune(extract_fit_engine(regtreefinalfit), cp=2.225653e-05)

save(regtreespec, regtreewf, param_grid, tune_res, bestcomplexity, regtreefinal, regtreefinalfit, regtreefinalpruned,  file="regtree_models.RData")



I used the getcp() function to find the tree that would result in the lowest level of cross-validation error, that being this tree. Now we can look at our decision tree: it’s extremely large, with dozens of nodes.

A zoomed-out screenshot of the overall decision tree

This looks like it might be a bit overfitted, but we’ll see how it turns out when we try to predict our testing data. A detailed image of the whole tree is available at the link in DecisionTree.txt. Note that the file is about 1.09 GB and so will take some time to load - you will need to zoom in to view the whole tree. One main thing to note is that the tree model only chose two predictors: average_temperature_o_f and average_dew_point_o_f. Evidently, these two predictors are the most relevant to humidity, with other factors not predicting it as accurately.

Let’s look at how well this tree predicts the testing set.

weather_test %>%
  mutate(
    .pred=predict(regtreefinalpruned, weather_test)
  ) %>% 
  rmse(truth=average_humidity, estimate=.pred)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        2.56



Our tree is actually surprisingly accurate! Using the cp value provided by getcp(), we find that the root mean square error is 2.56%. Considering that the humidity ranges from 27.4% to 99.5%, this is actually a pretty good margin of error. Our decision tree is fairly accurate.

We can also see a scatterplot of the actual data points compared to the predicted values:

weather_test %>%
  mutate(
    .pred=predict(regtreefinalpruned, weather_test)
  ) %>%
  ggplot(aes(average_humidity, .pred)) +
  geom_abline() +
  geom_point(alpha = 0.5)



It looks like the points follow the line pretty well! Interestingly, the points start to get closer to the line as the percentages grow. The error grows larger the smaller the percentage. This is likely because Santa Barbara is located on the coastline, so humidity is quite stable in the 70-90% range. Because of this stability, most of the data points are located in that range as well, giving the model more data with which to predict humidity. Lower humidity levels are comparatively rare, so a lot more “noise” appears in the model.

Random Forest with Bagging



Next we will fit a Random Forest model with bagging. We set mtry to 6 since there are 6 predictors.

set.seed(3945)

bagging_spec <- rand_forest(mtry = 6) %>%
  set_engine("randomForest") %>%
  set_mode("regression")

baggingwf <- workflow() %>%
  add_model(bagging_spec %>% set_args(min_n = tune())) %>%
  add_formula(average_humidity ~ maximum_wind_speed_mph + average_wind_speed_mph + average_pressure_in_hg + precipitation_in + average_temperature_o_f + average_dew_point_o_f)

param_grid_bag <- grid_regular(min_n(), levels = 5)

tune_res_bag <- tune_grid(
  baggingwf,
  resamples = weather_fold, 
  grid = param_grid_bag
)

bestcomplexitybag <- select_best(tune_res_bag, metric = "rmse")

bagfinal <- finalize_workflow(baggingwf, bestcomplexitybag)

bagfinalfit <- fit(bagfinal, data = weather_train)



Now let’s see how this model performs on the testing set.

augment(bagfinalfit, new_data = weather_test) %>%
  rmse(truth = average_humidity, estimate = .pred)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        1.45



The RMSE for this model is significantly lower than for the Decision Tree model - we have an RMSE of 1.45% as compared to 2.56%, a full 1.11% lower.

Boosted Tree

Now we fit a Boosted Tree model. We set trees to 5000 and mtry to 6 like the previous model, then tune tree_depth and min_n, using the xgboost engine.

set.seed(3945)

boost_spec <- boost_tree(trees = 5000, learn_rate = tune()) %>%
  set_engine("xgboost") %>%
  set_mode("regression")

boostwf <- workflow() %>%
  add_model(boost_spec) %>%
  add_formula(average_humidity ~ maximum_wind_speed_mph + average_wind_speed_mph + average_pressure_in_hg + precipitation_in + average_temperature_o_f + average_dew_point_o_f)
  
param_grid_boost <- grid_regular(learn_rate(range=c(-10, -1)), levels = 5)

tune_res_boost <- tune_grid(
  boostwf,
  resamples = weather_fold, 
  grid = param_grid_boost
)

bestcomplexityboost <- select_best(tune_res_boost, metric = "rmse")

boostfinal <- finalize_workflow(boostwf, bestcomplexityboost)

boostfinalfit <- fit(boostfinal, data = weather_train)



Let’s see how the Boosted Tree model performs on the testing data.

augment(boostfinalfit, new_data = weather_test) %>%
  rmse(truth = average_humidity, estimate = .pred)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        1.39



Our RMSE for this model is 1.39%, a bit lower than it was for the Random Forest model! This is 1.17% less than our Decision Tree RMSE.

### Linear regression

Finally, we fit a Linear Regression model. This is straightforward compared to the other models.

set.seed(3945)

lm_model <- linear_reg() %>% 
  set_engine("lm")

lm_wflow <- workflow() %>% 
  add_model(lm_model) %>% 
  add_recipe(weather_recipe)

lm_fit <- fit(lm_wflow, weather_train)



Now let’s see the performance of the Linear Regression model on the testing set.

augment(lm_fit, new_data = weather_test) %>%
  rmse(truth = average_humidity, estimate = .pred)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        2.26

The RMSE for this model is 2.26% - though this is certainly on the higher side, it’s still 0.3% lower than the Decision Tree model.

Finally, let’s compare all four models.